บทนำสู่การเรียนรู้เสริมแรงแบบลึก (DRL)

การเรียนรู้เสริมแรงแบบลึก (DRL) รวมความสามารถในการแทนค่าข้อมูลในมิติสูงของ เครือข่ายประสาทเทียมลึก กับกรอบการควบคุมที่เหมาะสมของ การเรียนรู้เสริมแรงโดยไม่เหมือนการเรียนรู้แบบมีผู้สอนหรือไม่มีผู้สอน การเรียนรู้เสริมแรงแบบลึก เอเจนต์ เรียนรู้ผ่านการลองผิดลองถูกโดยมีปฏิสัมพันธ์ภายในสภาพแวดล้อมที่เปลี่ยนแปลงได้ สภาพแวดล้อมโดยทำการตัดสินใจแบบลำดับขั้นตอน การตัดสินใจตามลำดับ โดยไม่ต้องอาศัยข้อมูลป้ายกำกับที่ชัดเจนและทันที ซึ่งการรวมนี้ทำให้เอเจนต์สามารถจัดการกับข้อมูลดิบซับซ้อน (เช่น ข้อมูลภาพพิกเซล) โดยตรงได้

1. แนวทางการเรียนรู้ของ DRL

เอเจนต์การเรียนรู้เสริมแรงทำงานในวงจรต่อเนื่อง: สังเกตสภาพแวดล้อม สถานะ ($S_t$)ดำเนินการ การกระทำ ($A_t$)และรับสัญญาณผลตอบแทนสเกลาร์ที่อาจเบาบางหรือล่าช้า ผลตอบแทน ($R_{t+1}$)ความท้าทายหลักคือปัญหา ปัญหาการกำหนดเครดิตคือ การระบุว่าการกระทำในอดีตใดบ้างที่เป็นต้นเหตุของสัญญาณผลตอบแทนในอนาคต

2. เป้าหมายการปรับแต่ง

เป้าหมายสุดท้ายคือการค้นพบกลยุทธ์ที่เหมาะสมที่สุด หรือ นโยบาย ($\pi^*$)ซึ่งเป็นการแม핑จากสถานะไปยังการกระทำ ที่ทำให้ค่า ผลรวมผลตอบแทนที่ลดลงตามคาดหวัง ($G_t$)ตัวแปร ตัวลดมูลค่า ($\gamma \in [0, 1]$) มีความสำคัญทางคณิตศาสตร์อย่างยิ่ง กำหนดว่าเราให้คุณค่ากับผลตอบแทนทันทีมากแค่ไหนเมื่อเทียบกับผลตอบแทนที่คาดว่าจะได้ในระยะยาว

$$G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$$

The Fundamental RL Cycle

An illustration of the Markov Decision Process (MDP) framework. The Agent's policy dictates the action ($A_t$) based on the current state ($S_t$), leading the Environment to transition to a new state ($S_{t+1}$) and provide a reward ($R_{t+1}$).

The Reinforcement Learning Cycle: Agent, Environment, State, Action, Reward

Question 1

How does the DRL agent receive feedback from the environment?

Explicit labels/targets

Backpropagation through time

Scalar reward signal

Labeled demonstration data

Question 2

What does the policy ($\pi$) mathematically represent?

The predicted total reward

A distribution over actions given a state

The probability of transitioning to a new state

The error between predicted and actual returns

Challenge: The Discount Factor

Analyzing the Temporal Horizon.

Consider two scenarios:
1. $\gamma = 0$
2. $\gamma \approx 1$

Describe the agent's behavioral preference in each case regarding the timeline of rewards.

Step 1

How does the choice of $\gamma$ affect the policy's horizon?

Solution:
If $\gamma = 0$, the agent is myopic (shortsighted), focusing only on the immediate reward $R_{t+1}$. If $\gamma \approx 1$, the agent is far-sighted, equally weighting immediate and distant future rewards, leading to planning over a very long horizon.